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Abstract 

This paper provides the best bounds to date on the number of randomly sampled entries required to reconstruct 
an unknown low rank matrix. These results improve on prior work by Candes and Recht [4], Candes and Tao [7], 
and Keshavan, Montanari, and Oh [18]. The reconstruction is accomplished by minimizing the nuclear norm, or 
sum of the singular values, of the hidden matrix subject to agreement with the provided entries. If the underlying 
matrix satisfies a certain incoherence condition, then the number of entries required is equal to a quadratic logarithmic 
factor times the number of parameters in the singular value decomposition. The proof of this assertion is short, self 
contained, and uses very elementary analysis. The novel techniques herein are based on recent work in quantum 
information theory. 

Keywords. Matrix completion, low-rank matrices, convex optimization, nuclear norm minimization, random 
matrices, operator Chernoff bound, compressed sensing. 

1 Introduction 

Recovering a low rank matrix from a given subset of its entries is a recurring problem in collaborative filtering [25], 
dimensionality reduction [20,28], and multi-class learning [2,22]. While a variety of heuristics have been devel- 
oped across many disciplines, the general problem of finding the lowest rank matrix satisfying equality constraints 
is NP-hard. All known algorithms which can compute the lowest rank solution for all instances require time at least 
exponential in the dimensions of the matrix in both theory and practice [9]. 

In sharp contrast to such worst case pessimism, Candes and Recht showed that most low rank matrices could be 
recovered from most sufficiently large sets of entries by computing the matrix of minimum nuclear norm that agreed 
with the provided entries [4], and furthermore the revealed set of entries could comprise a vanishing fraction of the 
entire matrix. The nuclear norm is equal to the sum of the singular values of a matrix and is the best convex lower 
bound of the rank function on the set of matrices whose singular values are all bounded by 1. The intuition behind 
this heuristic is that whereas the rank function counts the number of nonvanishing singular values, the nuclear norm 
sums their amplitude, much like how the l\ norm is a useful surrogate for counting the number of nonzeros in a vector. 
Moreover, the nuclear norm can be minimized subject to equality constraints via semidefinite programming. 

Nuclear norm minimization had long been observed to produce very low-rank solutions in practice (see, for exam- 
ple [3, 1 1, 12, 21, 26]), but only very recently was there any theoretical basis for when it produced the minimum rank 
solution. The first paper to provide such foundations was [24], where Recht, Fazel, and Parrilo developed probabilistic 
techniques to study average case behavior and showed that the nuclear norm heuristic could solve most instances of 
the rank minimization problem assuming the number of linear constraints was sufficiently large. The results in [24] in- 
spired a groundswell of interest in theoretical guarantees for rank minimization, and these results lay the foundation 
for [4]. Candes and Recht's bounds were subsequently improved by Candes and Tao [7] and Keshavan, Montanari, 
and Oh [18] to show that one could, in special cases, reconstruct a low -rank matrix by observing a set of entries of size 
at most a polylogarithmic factor larger than the intrinsic dimension of the variety of rank r matrices. 
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This paper sharpens the results in [7, 18] to provide a bound on the number of entries required to reconstruct a low 
rank matrix which is optimal up to a small numerical constant and one logarithmic factor. The main theorem makes 
minimal assumptions about the low rank matrix of interest. Moreover, the proof is very short and relies on mostly 
elementary analysis. 

In order to precisely state the main result, we need one definition. Candes and Recht observed that it is impossible 
to recover a matrix which is equal to zero in nearly all of its entries unless all of the entries of the matrix are observed 
(consider, for example, the rank one matrix which is equal to 1 in one entry and zeros everywhere else). In other 
words, the matrix cannot be mostly equal to zero on the observed entries. This motivated the following definition 

Definition 1.1 Let U be a subspace of WL n of dimension r and Pjj be the orthogonal projection onto U. Then the 
coherence ofU (vis-a-vis the standard basis (e, )J is defined to be 

fi(U) = - max UPc/eJ 2 . (1.1) 

r l<i<n 

Note that for any subspace, the smallest fi(U) can be is 1, achieved, for example, if U is spanned by vectors whose 
entries all have magnitude 1 j \fn. The largest possible value for /j(J7) is n/r which would correspond to any subspace 
that contains a standard basis element. If a matrix has row and column spaces with low coherence, then each entry can 
be expected to provide about the same amount of information. 

Recall that the nuclear norm of an ni x n 2 matrix X is the sum of the sing ular values of X, \\X\\, = Y^ nun * } a k (X), 
where, here and below, a^X) denotes the fcth largest singular value of X. The main result of this paper is the fol- 
lowing 

Theorem 1.1 Let M be an ri\ x ?i 2 matrix of rank r with singular value decomposition U£V*. Without loss of 
generality, impose the conventions n\ < n% X is r x r, U is n\ x r and V is n% x r. Assume that 

AO The row and column spaces have coherences bounded above by some positive /io- 

Al The matrix UV* has a maximum entry bounded by [i\ \fr/(niTi2) in absolute value for some positive p,\. 
Suppose to entries of M are observed with locations sampled uniformly at random. Then if 

to > 32max{/i 2 ,/7o}r( n i + n 2 ) /31og 2 (2n 2 ) (1.2) 
for some (3 > 1, the minimizer to the problem 

minimize ll^lj* n 
subject to Xij = Mij € 0. 

is unique and equal to M with probability at least 1 — 6 log(n2)(«i + 71.2) 2-2 ' 3 — n ^2~ 2 ^ 

The assumptions AO and Al were introduced in [4]. Both /io and [i\ may depend on r, n\, or n^. Moreover, 
note that [i\ < /io \pr by the Cauchy-Schwarz inequality. As shown in [4], both subspaces selected from the uniform 
distribution and spaces constructed as the span of singular vectors with bounded entries are not only incoherent with 
the standard basis, but also obey Al with high probability for values of [i\ at most logarithmic in m and/or n 2 . 
Applying this theorem to the models studied in Section 2 of [4], we find that there is a numerical constant c u such 
that c u r(ni + n?) log 5 (n2) entries are sufficient to reconstruct a rank r matrix whose row and column spaces are 
sampled from the Haar measure on the Grassmann manifold. If r > log(ri2), the number of entries can be reduced 
to c u r{n\ + 772) log 4 (n 2 ). Similarly, there is a numerical constant c, such that Ci/i 2 r(ni + n 2 ) log 3 (n2) entries are 
sufficient to recover a matrix of arbitrary rank r whose singular vectors have entries with magnitudes bounded by 

Theorem 11.11 greatly improves upon prior results. First of all, it has the weakest assumptions on the matrix to be 
recovered. In addition to assumption Al, Candes and Tao require a "strong incoherence condition" (see [7]) which is 
considerably more restrictive than the assumption AO in Theorem ll.il Many of their results also require restrictions 
on the rank of M, and their bounds depend superlinearly on /io- Keshavan et al require the matrix rank to be no more 
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than log(ri2 ), and require bounds on the maximum magnitude of the entries in M and the ratios <j\ (M) / oy ( Af ) and 
n<zjrb\. Theorem l 1 . 1 I makes no such assumptions about the rank, aspect ratio, nor condition number of M. Moreover, 
dl.2| > has a smaller log factor than [7], and features numerical constants that are both explicit and small. 

Also note that there is not much room for improvement in the bound for m. It is a consequence of the coupon 
collector's problem that at least n,2 log ri2 uniformly sampled entries are necessary just to guarantee that at least one 
entry in every row and column is observed with high probability. In addition, rank r matrices have r(ni + n-i — r) 
parameters, a fact that can be verified by counting the number of degrees of freedom in the singular value decompo- 
sition. Interestingly, Candes and Tao showed that C^nir log(ri2) entries were necessary for completion when the 
entries are sampled uniformly at random [7]. Hence, ( 11.21 ) is optimal up to a small numerical constant times log(ri2). 

Most importantly, the proof of Theorem 1 1.1 1 is short and straightforward. Candes and Recht employed sophisti- 
cated tools from the study of random variables on Banach spaces including decoupling tools and powerful moment 
inequalities for the norms of random matrices. Candes and Tao rely on intricate moment calculations spanning over 30 
pages. The present work only uses basic matrix analysis, elementary large deviation bounds, and a noncommutative 
version of Bernstein's Inequality proven here in the Appendix. 

The proof of Theorem 11.11 is inspired by a recent paper in quanutm information which considered the problem 
of reconstructing the density matrix of a quantum ensemble using as few measurements as possible [16]. Their work 
adapted results from [4] and [5] to the quantum regime by using special algebraic properties of quantum measurements. 
Their proof followed a methodology analogous to the approach of Candes and Recht but had two main differences: 
they used a sampling with replacement model as a proxy for uniform sampling, and they deployed a powerful non- 
commutative Chernoff bound developed by Ahlswede and Winter for use in quantum information theory [1]. In this 
paper, I adapt these two strategies from [16] to the matrix completion problem. In sectional show how the sampling 
with replacement model bounds probabilities in the uniform sampling model, and present very short proofs of some 
of the main results in [4]. Surprisingly, this yields a simple proof of Theorem ll.il provided in Section HJ which has 
the least restrictive assumptions of any assertion proven thus far. 

2 Preliminaries and notation 

Before continuing, let us survey the notations used throughout the paper. I closely follow the conventions established 
in [4], and invite the reader to consult this reference for a more thorough discussion of the matrix completion problem 
and the associated convex geometry. A thorough introduction to the necessary matrix analysis used in this paper can 
be found in [24] . 

Matrices are bold capital, vectors are bold lowercase and scalars or entries are not bold. For example, X is a 
matrix, and Xy its (i, j)th entry. Likewise a? is a vector, and Xi its ith component. If Uk £ R™ for 1 < k < d is 
a collection of vectors, [u\, . . . , Ud\ will denote the n x d matrix whose fcth column is will denote the kt\\ 

standard basis vector in R d , equal to 1 in component k and everywhere else. The dimension of will always be 
clear from context. X* and x* denote the transpose of matrices X and vectors x respectively. 

A variety of norms on matrices will be discussed. The spectral norm of a matrix is denoted by \\X\\. The Euclidean 
inner product between two matrices is (X,Y) = Tr(X*Y), and the corresponding Euclidean norm, called the 
Frobenius or Hilbert-Schmidt norm, is denoted ||X|| F . That is, \\X\\ F = (X,Xy/ 2 . The nuclear norm of a matrix 
X is 11X11*. The maximum entry of X (in absolute value) is denoted by 1 1 X 1 1 ^ = max^ | Xy | . For vectors, the only 
norm applied is the usual Euclidean I2 norm, simply denoted as ||cc||. 

Linear transformations that act on matrices will be denoted by calligraphic letters. In particular, the identity 
operator will be denoted by T. The spectral norm (the top singular value) of such an operator will be denoted by 
Mil = su Px : ||x|| F <i !I-4(X)|| F . 

Fix once and for all a matrix M obeying the assumptions ofTheorem ll.il Let Uk (respectively Vk) denote the kth 
column of U (respectively V). Set U = span (u±, . . . , u r ), and V = span (vi, . . . , v r ). Also assume, without loss 
of generality, that n\ < ri2- It is convenient to introduce the orthogonal decomposition W 11 x ™ 2 = T © T 1 - where T 
is the linear space spanned by elements of the form u^y* and xv* k , 1 < k < r, where x and y are arbitrary, and T 1 - 
is its orthogonal complement. T 1 - is the subspace of matrices spanned by the family (xy*), where x (respectively y) 
is any vector orthogonal to U (respectively V). 
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The orthogonal projection Vt onto T is given by 

Vt(Z) = P V Z + ZP V - PuZPy, (2.1) 

where Pjj and Py are the orthogonal projections onto U and V respectively. Note here that while Pjj and Py are 
matrices, Vt is a linear operator mapping matrices to matrices. The orthogonal projection onto T 1 - is given by 

V T ±(Z) = (I- V T ){Z) = (J ni - Pu)Z(I n2 - Py) 

where Id denotes the d x d identity matrix. It follows from the definition d2.lt of Vt that 

V T (e a e* b ) = (Pue a )e* b + e a {P v e b )* - {Pue a )(P v e b )* , 

This gives 

\\V T (e a e* b )\\ 2 F = (V T (e a el),e a et) = \\P v e a \\ 2 + \\P v e b \\ 2 - \\Pue a f \\P v e b \\ 2 . 
Since \\Pue a \\ 2 < (i(U)r/ni and ||Pye h || 2 < /j,(V)r/n 2 , 

iit. i *mi2 / r /m nrw n i + n 2 . n x + n 2 

e a e 6 jr < max{/i(C/),/i(F)}r < (2.2) 

nin 2 nin 2 

I will make frequent use of this calculation throughout the sequel. 

3 Sampling with Replacement 

As discussed above, the main contribution of this work is an analysis of uniformly sampled sets of entries via the study 
of a sampling with replacement model. All of the previous work [4,7, 18] studied a Bernoulli sampling model as a 
proxy for uniform sampling. There, each entry was revealed independently with probability equal to p. In all of these 
results, the theorem statements concerned sampling sets of m entries uniformly, but it was shown that probability 
of failure under Bernoulli sampling with p = -^j^ closely approximated the probability of failure under uniform 
sampling. The present work will analyze the situation where each entry index is sampled independently from the 
uniform distribution on {1, . . . , m} x {1, . . . , n 2 }. This modification of the sampling model gives rise to all of the 
simplifications below. 

It would appear that sampling with replacement is not suitable for analyzing matrix completion as one might 
encounter duplicate entries. However, just as is the case with Bernoulli sampling, bounding the likelihood of error 
when sampling with replacement allows us to bound the probability of the nuclear norm heuristic failing under uniform 
sampling. 

Proposition 3.1 The probability that the nuclear norm heuristic fails when the set of observed entries is sampled 
uniformly from the collection of sets of size m is less than or equal to the probability that the heuristic fails when m 
entries are sampled independently with replacement. 

Proof The proof follows the argument in Section II. C of [6]. Let Q! be a collection of m entries, each sampled 
independently from the uniform distribution on {1, . . . , ni} x {1, . . . , n 2 }. Let ilk denote a set of entries of size k 
sampled uniformly from all collections of entries of size k. It follows that 

m 

P(Failure(fi')) = ^ P(Failure(r2') | = fc)P(|fi'| = fc) 

k=0 

nt 

= ^P(Failure(r2 fc ))- p (l^'l = k ) 

k=0 

m 

> P(Failure(fJ m )) ^ P(|fi'| =k) = P (Failure (fi m )) . 

k=0 
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Where the inequality follows because P(Failure(S! m )) > P(Failure(f2 m /)) if m < ml '. That is, the probability 
decreases as the number of entries revealed is increased. ■ 

Surprisingly, changing the sampling model makes most of the theorems from [4] simple consequences of a non- 
commutative variant of Bernstein's Inequality. 

Theorem 3.2 (Noncommutative Bernstein Inequality) Let X\ , . . . , Xl be independent zero-mean random matri- 
ces of dimension d\ x d 2 . Suppose p\ = max{|| E[_Xfc.Xj!]||, || E[X^Xfc]||} and \X\\ < M almost surely for all k. 
Then for any r > 0, 



k=l 



> T 



< (di + d 2 ) exp 



»/2 



Note that in the case that d\ = d 2 — 1, this is precisely the two sided version of the standard Bernstein In- 
equality. When the Xf. are diagonal, this bound is the same as applying the standard Bernstein Inequality and a 
union bound to the diagonal of the matrix summation. Furthermore, observe that the right hand side is less than 
(di + <i 2 ) cxp(— §T 2 /(X]fc=i Pk)) as l° n g as T — JT Efe=i Pk- This condensed form of the inequality will be used 
exclusively throughout. Theorem |3.2| is a corollary of an Chernoff bound for finite dimensional operators developed 
by Ahlswede and Winter [1]. A similar inequality for symmetric i.i.d. matrices is proposed in [16]. The proof is 
provided in the Appendix. 

Let us now record two theorems, proven for the Bernoulli model in [4], that admit very simple proofs in the 
sampling with replacement model. The theorem statements requires some additional notation. Let £1 = {(ctfc, bk)} l k=1 
be a collection of indices sampled uniformly with replacement. Set TZq to be the operator 

|0| 

Kn(Z) = Y,(ea k e* bk ,Z)e ah e* bk . 

k=l 

Note that the (i, j)th component of 1Zq(X) is zero unless G fi. For G TZq(X) is equal to Xy times 
the multiplicity of G O. Unlike in previous work on matrix completion, TZq, is not a projection operator if there 
are duplicates in £1 Nonetheless, this does not adversely affect the argument, and IZn(X) = if and only if = 
for all (a, b) G O. Moreover, we can show that the maximum duplication of any entry is always less than | log(ri2) 
with very high probability. 

Proposition 3.3 With probability at least l — n 2 , the maximum number of repetitions of any entry in Q is less than 
|/31og(n 2 )/orn 2 > 9 and (3 > 1. 

Proof This assertion can be proven by applying a standard Chernoff bound for the Bernoulli distribution. Note that for 
a fixed entry, the probability it is sampled more than t times is equal to the probability of more than t heads occurring 
in a sequence of m tosses where the probability of a head is This probability can be upper bounded by 

( m \ * / m 
P more than t heads in m trials < exp t 

\n1n2t J \ n\"ti2 , 

(see [17], for example). Applying the union bound over all of the n\ri2 entries and the fact that < 1, we have 

P[any entry is selected more than |/31og(n 2 ) times] < nin 2 (|/31og(n 2 )) 3 /31og '" 2 - ) CX p (|/Jlog(n 2 )) < n^ 213 
when n 2 > 9. ■ 

This application of the Chernoff bound is very crude, and much tighter bounds can be derived using more careful 
analysis. For example in [15], the maximum oversampling is shown to be bounded by 0( ^^n,) )■ For our purposes 
here, the loose upper bound provided by Proposition l3.3l will be more than sufficient. 

In addition to this bound on the norm of JZn, the following theorem asserts that the operator Vt'R-q'Pt is also 
very close to an isometry on T if the number of sampled entries is sufficiently large. This result is analgous to the 
Theorem 4. 1 in [4] for the Bernoulli model, whose proof uses several powerful theorems from the study of probability 
in Banach spaces. Here, one only needs to compute a few low order moments and then apply Theorem l3.2l 
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Theorem 3.4 Suppose Q is a set of entries of size m sampled independently and uniformly with replacement. Then 
for all /3 > 1, 

nin 2 



m 

V T nnV T V T 

nin 2 



16^ r(r7,i + n 2 ) /?log(n 2 ) 
3m 



with probability at least 1 — 2n 2 provided that m > ^/io r ( n i + n 2) P ^og(n 2 ). 
Proof Decompose any matrix Z as Z = ^2 a f,(Z, e a e b )e a el so that 

VT(Z)=J2(T ? T(Z),e a el)e a et=J2(Z,VT(e a et))e a el. (3.1) 

ab ab 

For k = 1, . . . , m sample (a,k, bk) from {1, . . . , ni} x {1, . . . , n 2 } uniformly with replacement. Then IZqPt(Z) = 
EfeLi ( Z >^T(e afc e*J) e afc e* fc which gives 

m 

fe=l 

Now the fact that the operator VtT^oPt does not deviate from its expected value 

vn TTL 

E{V T nnV T ) = T T {EKn)VT = V T { 1)V T = V T 

n\n 2 n\n 2 

in the spectral norm can be proven using the Noncommutative Bernstein Inequality. 

To proceed, define the operator T ab which maps Z to (?r(e a e[|), Z)VT{^aS-l). This operator is rank one, has 
operator norm || T oh 1 1 = ||:P T (e a e£)||f,, and we have V T = Y, a ,b Tab by (Ej). Hence, for k = 1, . . . , m, E[T akbk ] = 
-^—Vt- 

Observe that if A and B are positive semidefinite, we have || A — B\\ < max{||A||, ||B||}. Using this fact, we 
can compute the bound 



\ T a k b k - ^-Vt\\ < m ax {\\V T (e ak e* bk )\\ 2 F , ^} < W 



nin 2 

where the final inequality follows from ( 12.21 ). We also have 

\\n(Ta kbk ^Vt?]\\ = || E[\\V T (e ak el)\\lT akbk ] -^V T ]\\ 

< max{|| E[\\V T (e ak et k )f F T akbk ]\\, 

s rniBT'T in m+n 2 1 , ni+n 2 

< max E T afcb J fi r , -^-j} < (J. Q r — T - 2 — 

The theorem now follows by applying the Noncommutative Bernstein Inequality. ■ 

The next theorem is an analog of Theorem 6.3 in [4] or Lemma 3.2 in [18]. This theorem asserts that for a fixed 
matrix, if one sets all of the entries not in il to zero it remains close to a multiple of the original matrix in the operator 
norm. 

Theorem 3.5 Suppose Q is a set of entries of size m sampled independently and uniformly with replacement and let 
Z be a fixed n\ x n 2 matrix. Assume without loss of generality that n\ < n 2 , Then for all f3 > 1, 



< ^ 8/3riin| logjjH + n 2 ) 
3m 



V m / 

with probability at least 1 — (ni + n^ 1- ^ provided that m > 6/3?ii log(?ii + n 2 ) 
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Proof First observe that the operator norm can be upper bounded by a multiple of the matrix infinity norm 

1/2 / \ 1/2 



\Z\\ = sup ZabVaXb < Z 2 ab y 2 a < \A^max ^ Z 2 ab \ < y/n{ 



-\\z\\ 



Note that "*" 2 IZn {Z) — Z = ^ 53/bLi n i n 2Z ak b k e Qfc e£ — -Z\ This is a sum of zero-mean random matrices, and 

\n in2 Z akbk e ak e*^ - Z\\ < \\ ni n 2 Z akbk e ak et |' J 



\Z\\ < |nin 2 ||Z|| 00 for n i > 2. We also have 



|E [(nin 2 Z afc6(i e at: eb fc - Z)* (n 1 n 2 Z akbk e ak e* bk - Z)] 



»i"2^ Z 2 d e d e* d - Z*Z 



< max 



l l"2^ Z cd e <i e d 



c . r/ 



Z*Z| 



<mn§||z||; 



where we again use the fact that || A — B\\ < max{|| A\\ , \\B\\} for positive semidefinite A and B. A similar calcula- 
tion holds for (n 1 ri2Z akbk e ak e b ' k — Z^irixn^Z akbk e ak e\ k — Z)* . The theorem now follows by the Noncommutative 
Bernstein Inequality. ■ 

Finally, the following Lemma is required to prove Theorem l 1.11 Succinctly, it says that for a fixed matrix in T, the 
operator VtT^q, does not increase the matrix infinity norm. 

Lemma 3.6 Suppose O is a set of entries of size m sampled independently and uniformly with replacement and let 
Z £ T be a fixed ri\ x n% matrix. Assume without loss of generality that n\ < n 2 . Then for all (3 > 2, 



rn 



T T K n (Z) - Z 



8Pfi r(m + n 2 )logn 2 



3 m 



with probability at least 1 — 2^ ^ provided that m > |/3/ior(ni + n 2 ) logn 2 . 

Proof This lemma can be proven using the standard Bernstein Inequality. For each matrix index (c, d), sample (a, b) 
uniformly at random to define the random variable ^ cd = {e c e* dl nin 2 (e a e£, Z)VT(fiae.*h) — Z). We have E[£ c d] = 0, 

\(,cd\ < + n 2 )\\Z\\ 00 , and 



Cd\ = y^{e c e d ,nin2{e a e* b ,Z)V T {e a el) - Z) 2 

11 i 71 r> * ■* 



nin 2 

?7.in 2 ^(7 3 T(e c e^),e a e^) 2 (e a e^Z) 2 - Z 2 d 

a.b 

2 II rr\\2 ^ ../.. , „ Ml rr\\2 



< nin 2 ||P T ( e e e d)llFll^lloo < ^ r{n 1 +n 2 )\\Z 

■n(Z) — Z is identically c 
£ cd , we have by Bernstein's Inequality and the union bound: 



Since the (c, d) entry of ^^■VrT^niZ) — Z is identically distributed to ^ J2T=i ^c^' wnere £cd are i-i-d- copies of 



Pr 



niU2 



id 



v T n n (z) - z 



8f3fj, r(ni + n 2 ) logQ 2 ) 



3m 



< 2nin 2 exp(— /31og(n 2 )) < 2n 



2-/3 
2 
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4 Proof of Theorem 



1.1 



The proof follows the program developed in [16] which itself adapted the strategy proposed in [4]. The main idea is 
to approximate a dual feasible solution of ( 11.3b which certifies that M is the unique minimum nuclear norm solution. 
In [4] such a certificate was constructed via an infinite series using a construction developed in the compressed sensing 
literature [6, 13]. The terms in this series were then analyzed individually using the decoupling inequalities of de la 
Pena and Montgomery- Smith [10]. Truncating the infinite series after 4 terms gave their result. In [7], the authors 
bounded the contribution of O (log (71.2)) terms in this series using intensive combinatorial analysis of each term. The 
insight in [16] was that, when sampling observations with replacement, a dual feasible solution could be closely 
approximated by a modified series where each term involved the product of independent random variables. This 
change in the sampling model allows one to avoid decoupling inequalities and gives rise to the dramatic simplification 
here. 

To proceed, recall again that by Proposition 13. II it suffices to consider the scenario when the entries are sampled 
independently and uniformly with replacement. I will first develop the main argument of the proof assuming many 
conditions hold with high probability. The proof is completed by subsequently bounding probability that all of these 
events hold. Suppose that 

<\, ll^nll < |/3 1/2 log(n 2 ). (4.1) 

Also suppose there exists a Y in the range of TZq such that 



\\V T (Y)-UV*\\ F < HTVOOII < 5 ( 4 -2) 

If ( 14. U holds, then for any Z £ ker TZq, Vt(Z) cannot be too large. Indeed, we have 

= \\K n {Z)\\ F > \\n n V T (Z)\\ F - \\KnV T ±(Z)\\ F . 

Now observe that 

\\KnV T (z)\\ F = (z,r T nlVT(z)) > (z,v T nnV T {z)) > \\r T (z)\\ F 

and \\KnV T ± (Z)\\ F < f/3 1 / 2 log(n 2 )||7V (Z)\\ F . Collecting these facts gives that for any Z £ kcr ft f2 , 

\\ V ^(Z)\\ F > ./ fl -\\V T (Z)\\ F > M\\V T (Z)\\ F . 

y 128/3nin 2 log^(n 2 ) V n 2 

Now recall that || = sup|| B || <1 (A, B). For Z £ ker7?.o, pick U± and VI such that [U, U±] and [V, VI] are 
unitary matrices and that (U±Vl,V T ±(Z)) = \\V T ±(Z)\\*. Then it follows that 

\\M + Z\\* > (UV* + U±Vl,M + Z) 

= ||Af||* + (UV* + U±V*, Z) 

= \\M\U + (UV* - Vt(Y),V t (Z)} + (U ± Vl - V T ±{Y),V T x{Z)) 
> \\M\U - J^WPt{Z)\\f + \\\V T ^Z)\U > \\M\U . 

The first inequality holds from the variational characterization of the nuclear norm. We also used the fact that (Y, Z) = 
for all Z £ kcrTvLo- Thus, if a Y exists obeying ( 14.21 i. we have that for any X obeying 1Zq(X — M) = 0, 
> ||iW||„. That is, any if X has M a b = X a b for all (a, b) £ ft, X has strictly larger nuclear norm than M, 
and hence M is the unique minimizer of ( 11.31 . The remainder of the proof shows that such a Y exists with high 
probability. 



nin 2 
m 



VtTZqVt - -^—V T 
nin 2 



8 



To this end, partition 1, . . . , m into p partitions of size q. By assumption, we may choose 

128 3 
q > — max{(j, , f4}r(ni + n 2 )/31og(m + n 2 ) and p>-log(2n 2 ). 

Let Qj denote the set of indices corresponding to the jth partition. Note that each of these partitions are independent 
of one another when the indices are sampled with replacement. Assume that 



nin 2 



n x n 2 



-Tt 



1 

< - 
- 2 



(4.3) 



for all k. Define Wo = UV* and set Y k = £* =1 K Qj (Wj-i), W k = UV* -V T (Y k ) for k = l,...,p. Then 



\W, 



k IF 



nin 2 



V T 1ln k {Wk-i) 



(V 7 



n x n 2 



and it follows that ||Wfc||jr < 2~ fc || W ||^ = 2~ k y/r. Since p > | log(2n 2 ) > ilog 2 (2n 2 ) = log 2 ^/2n 2 ', then 
Y = Y p will satisfy the first inequality of ( I4.2b . Also suppose that 



Wk-! - -^-V T Kn k (Wk-i] 
n\n 2 



Tin, -T ) (Wj-i) 



< 



' 8nin2/31ogn 2 



3q 



Wj-lHoo 



(4.4) 
(4.5) 



for k = 1 , . . . , p. 

Toseethat \\T T ±(Y p )\\ < \ when g3J and g3J hold, observe 1 1 W k \ \ ^ < 2- k \\UV* ||oo, and it follows that 



|7Vnll<£ll^PT^,Wj-i|| 
3=1 

= £ WVt^^Uu.Wj-! W-_i)| 



3=1 
P 



<£llPf^ -2)(w 3 -_i; 

3 = 1 



P 
3=1 



8nin2 /31og n 2 . 



3g 



3=1 



' 8nin 2 p\ogn 2 



3q 



uv*\ 



< 



1 32p\rn 2 (i\ogn 2 



3q 



< 1/2 



since q > -g^ 2 rn 2 /31og(n 2 ). The first inequality follows from the triangle inequality. The second line follows 
because Wj-\ £ T for all j. The third line follows because, for any Z, 

\\V T ±(Z)\\ = \\(I ni - P v )Z(I n2 - IV) || < ||Z|| . 

The fourth line applies ( 14.5b . The next line follows from (14.4b . The final line follows from the assumption Al. 

All that remains is to bound the probability that all of the invoked events hold. With m satisfying the bound in 
the main theorem statement, the first inequality in (14. It fails to hold with probability at most 2n 2 ,~ by Theorem |3.4| 

and the second inequality fails to hold with probability at most r? 2 ~ 213 ' by Proposition [33] For all k, ( 14.3b fails to 
hold with probability at most 2n 2_2 ^, ( 14.4b fails to hold with probability at most 2n 2 ~ 2 ^, and ( 14.5b fails to hold with 
probability at most (ni + n 2 ) 1 ~ 2 ' 9 . Summing these all together, all of the events hold with probability at least 

1 - 61og(n a )(m + n 2 ) 2 " 2/3 - n 2 " 2 ^ 2 
by the union bound. This completes the proof. 
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5 Discussion and Conclusions 



The results proven here are nearly optimal, but small improvements can possibly be made. The numerical constant 
32 in the statement of the theorem may be reducible by more clever bookkeeping, and it may be possible to derive 
a linear dependence on the logarithm of the matrix dimensions. But further reduction is not possible because of the 
necessary conditions provided by Candes and Tao. One minor improvement that could be made would be to remove 
the assumption Al. For instance, while [i\ is known to be small in most of the models of low rank matrices that have 
been analyzed, no one has shown that an assumption of the form Al is necessary for completion. Nonetheless, all 
prior results on matrix completion have imposed an assumption like Al [4, 7, 18], and it would be interesting to see if 
it can be removed as a requirement, or if it is somehow necessary. 

Surprisingly, the simplicity of the argument presented here mostly arises from the abandonment of Bernoulli 
sampling in favor of sampling with replacement. It would be of interest to review results investigating noise robustness 
of matrix completion [5, 19] or deconvolution of sparse and low rank matrices [8] to see if results can be improved 
by appealing to sampling with replacement. Furthermore, since much of the work on rank minimization and matrix 
completion borrows tools from the compressed sensing community, it is of interest to revisit this related body of work 
and to see if proofs can be simplified or bounds can be improved there as well. The noncommutative versions of 
Chernoff and Bernstein' s Inequalities may be useful throughout machine learning and statistical signal processing, 
and a fruitful line of inquiry would examine how to apply these tools from quantum information to the study of 
classical signals and systems. 
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A Operator Chernoff Bounds 

In this section, I present a proof of 13.21 and also provide new proofs of some probability bounds from quantum 
information theory. To review, a symmetric matrix A is positive semidefinite if all of its eigenvalues are nonnegative. 
If A and B are positive semidefinite matrices, A -< B means B — A is positive semidefinite. For square matrices A, 
the matrix exponential will be denoted exp(A) and is given by the power series 

oo A k 

exp(A) = E 

fc=0 

The following theorem is a generalization of Markov's inequality originally proven in [1]. My proof closely 
follows the standard proof of the traditional Markov inequality, and does not rely on discrete summations. 

Theorem A.l (Operator Markov Inequality [1]) Let X be a random positive semidefinite matrix and A a fixed 
positive definite matrix. Then 

F[X 2< A] < Tr(E[X]A _1 ) 
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Proof Note that if X £ A, then A^^XA- 1 ' 2 ^ I, and hence \\A~ 1 ^ 2 XA~ 1 / 2 \\ > 1. Let I x ^a denote the 
indicator of the event X -fc. A. Then Ix^A < Tr(A~ 1 / 2 X A -1 / 2 ) as the right hand side is always nonnegative, and, 
if the left hand side equals 1, the trace of the right hand side must exceed the norm of the right hand side which is 
greater than 1. Thus we have 

PLY z< A] = E[I x ^a] < ^[^{A^^XA- 1 ' 2 )] = Tr^X]^ 1 ) . 

where the last equality follows from the linearity and cyclic properties of the trace. ■ 

Next I will derive a noncommutative version of the Chernoff bound. This was also proven in [1] for i.i.d. matrices. 
The version stated here is more general in that the random matrices need not be identically distributed, but the proof 
is essentially the same. 

Theorem A.2 (Noncommutative Chernoff Bound) Let X\ , . . . , X n be independent symmetric random matrices in 
K. dxd . Let A be an arbitrary symmetric matrix. Then for any invertible d x d matrix T 



X k £ riA 



k=l 



< d Y[ ||E[exp(TX fe T* - TAT* 



k=l 



Proof The proof relies on an estimate from statistical physics which is stated here without proof. 
Lemma A.3 (Golden-Thompson inequality [14, 27]) For any symmetric matrices A and B, 

Tr(exp(A + B)) < Tr((exp A) (exp B)) . 
Much like the proof of the standard Chernoff bound, the theorem now follows from a long chain of inequalities. 



J2 X k^nA 



k=l 



.fe=l 



exp 



< Tr E 



E 



< E 



<Ei,..., n _i 



Y(X k -A)^0 

k=l 
n 

J2 T(X k - A)T* 

cxp (^pT(X k -A)T*j 
Trlexp ^T(X k -A)T*\ 

Tr (cxp ^ T(X k - A)T*^j cxp (T(X„ - A)T*) 

Tr (cxp [J2 T(X k - A)T*] E[cxp (T(X n - A)T*)] 



^k=l 



< ||E[exp(T(X n -A)T*)]||Ei,... >n . 



/n-\ 



Tr cxp Y T ( X * - A ) T * 



^k=l 



< [] l|E[cxp (T(X k - A)T*)]\\ E [Tr (exp (T(X 1 - A)T*))] 



k=2 
n 

< d]l\\E[e X p(T(X k - A)T*)}\ 

k=l 
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Here, the first three lines follow from standard properties of the semidefinite ordering. The fourth line invokes the 
Operator Markov Inequality. The sixth line follows from the Golden-Thompson inequality. The seventh line follows 
from independence of the X k . The eighth line follows because for positive definite matrices Ti(AB) < Tr(A)\\B\\. 
This is just another statement of the duality between the nuclear and operator norms. The ninth line iteratively repeats 
the previous two steps. The final line follows because for a positive definite matrix A, Tr(A) is the sum of the 
eigenvalues of A, and all of the eigenvalues are at most || A\\. ■ 

Let us now turn to proving the Noncommutative Bernstein Inequality presented in Section [3] The authors in [16] 
proposed a similar inequality for symmetric i.i.d. random matrices with a slightly worse constant. The proof here is 
more general and follows the standard derivation of Bernstein's inequality. 
Proof [of TheoremO Set 

X k 



Y k = 



Then Y k are symmetric random variables, and for all k 







Z 









x* k x k 



max{|| E[X k Xl 



[X* k X k }\\} 
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Moreover, the maximum singular value of J2 k =i -^-k is equal to the maximum eigenvalue of J2 k =i Y k - By Theo- 
rem Ia72] we have for all A > 



L 

E 

k=l 



Xi 



> Lt 



Y k 2? Ltl 



k=l 



< (d! + d 2 ) cxp(-LAi) Y[ ||E[cxp(AY fc ) 



fe=i 



For each k, let Y k = U k A k U k be an eigenvalue decomposition, where is the diagonal matrix of the eigenvalues 
of Y k . In turn, it follows that for s > 

-M S Y L 2 r< -U k M s AlUt r< U k A 2 + s Ut = Y 2+s r< U k M s A 2 h Ut r< M S Y? 
which then implies 



For fixed k, we have 



E[Y k °+ 2 } 



< M S ||E[Y^ 



E[exp(An)]|| < +E-H W 



< 1 



= 1 



J=2 " 
oo .j 

j=2 J ' 



\AP- 



2 00 \ 7 2 

^ E 7T Mj = 1 + Jf2(^M) 1 - AM) 

J=2 ■'■ 



< exp 



A 

M 2 



(exp(AM) - 1 - AM) 



(A.l) 



The first inequality follows from the triangle inequality and the fact that E[Yfc] = 0, the second inequality follows 
from (lA.ll i. and the final inequality follows from the fact that 1 + x < exp(a;) for all x. Putting this together gives 



5> 

fc=i 



> Lt 



< {d x + da) cxp \-XLt + ^ k rX Pk (exp(AM) 
\ M z 



1 - AM) 



This final expression is now just a real number, and only has to be minimized as a function of A. The theorem now 
follows by algebraic manipulation: the right hand side is minimized by setting A = 4j log(l + ^Jt i )» then basic 

/■ik 1 Pk 

approximations can be employed to complete the argument (see, for example [23], lectures 4 and 5). ■ 
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